Goto

Collaborating Authors

 mean cost


Appendix APerformanceonreal-worldbasedinstances

Neural Information Processing Systems

We further evaluate SGBS+EAS on nine real-world based instance sets from [15]. Each instance set consists of 20 instances that have similar characteristics (i.e., they have been sampled from the same underlying distribution). To account for this new evaluation setting, we always perform 10 runs in parallel for EAS and SGBS+EAS. This improves the solution quality, while leading only to a slight increase of the requiredruntime. For SGBS+EAS we set (β, γ) = (35,5), the learning rate α = 0.005 and λ = 0.05.


Appendix A Performance on real-world based instances

Neural Information Processing Systems

We further evaluate SGBS+EAS on nine real-world based instance sets from [15]. Each instance set consists of 20 instances that have similar characteristics (i.e., they have been sampled from the same underlying distribution). The instance sets differ significantly in terms of several structural properties, for example, the number of customers n and their position (e.g., clustered vs. random positions). A more detailed description of instance sets can be found in [15]. One major advantage of neural combinatorial optimization approaches over traditional handcrafted optimization methods is their ability to quickly learn customized heuristics for new problem settings.


Bandits with Anytime Knapsacks

arXiv.org Artificial Intelligence

We consider bandits with anytime knapsacks (BwAK), a novel version of the BwK problem where there is an \textit{anytime} cost constraint instead of a total cost budget. This problem setting introduces additional complexities as it mandates adherence to the constraint throughout the decision-making process. We propose SUAK, an algorithm that utilizes upper confidence bounds to identify the optimal mixture of arms while maintaining a balance between exploration and exploitation. SUAK is an adaptive algorithm that strategically utilizes the available budget in each round in the decision-making process and skips a round when it is possible to violate the anytime cost constraint. In particular, SUAK slightly under-utilizes the available cost budget to reduce the need for skipping rounds. We show that SUAK attains the same problem-dependent regret upper bound of $ O(K \log T)$ established in prior work under the simpler BwK framework. Finally, we provide simulations to verify the utility of SUAK in practical settings.


Constrained Meta Agnostic Reinforcement Learning

arXiv.org Artificial Intelligence

Meta-Reinforcement Learning (Meta-RL) aims to acquire meta-knowledge for quick adaptation to diverse tasks. However, applying these policies in real-world environments presents a significant challenge in balancing rapid adaptability with adherence to environmental constraints. Our novel approach, Constraint Model Agnostic Meta Learning (C-MAML), merges meta learning with constrained optimization to address this challenge. C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase. This fusion results in safer initial parameters for learning new tasks. We demonstrate the effectiveness of C-MAML in simulated locomotion with wheeled robot tasks of varying complexity, highlighting its practicality and robustness in dynamic environments.


Online Learning with Costly Features in Non-stationary Environments

arXiv.org Artificial Intelligence

Maximizing long-term rewards is the primary goal in sequential decision-making problems. The majority of existing methods assume that side information is freely available, enabling the learning agent to observe all features' states before making a decision. In real-world problems, however, collecting beneficial information is often costly. That implies that, besides individual arms' reward, learning the observations of the features' states is essential to improve the decision-making strategy. The problem is aggravated in a non-stationary environment where reward and cost distributions undergo abrupt changes over time. To address the aforementioned dual learning problem, we extend the contextual bandit setting and allow the agent to observe subsets of features' states. The objective is to maximize the long-term average gain, which is the difference between the accumulated rewards and the paid costs on average. Therefore, the agent faces a trade-off between minimizing the cost of information acquisition and possibly improving the decision-making process using the obtained information. To this end, we develop an algorithm that guarantees a sublinear regret in time. Numerical results demonstrate the superiority of our proposed policy in a real-world scenario.


Differentiable Divergences Between Time Series

arXiv.org Machine Learning

Computing the discrepancy between time series of variable sizes is notoriously challenging. While dynamic time warping (DTW) is popularly used for this purpose, it is not differentiable everywhere and is known to lead to bad local optima when used as a "loss". Soft-DTW addresses these issues, but it is not a positive definite divergence: due to the bias introduced by entropic regularization, it can be negative and it is not minimized when the time series are equal. We propose in this paper a new divergence, dubbed soft-DTW divergence, which aims to correct these issues. We study its properties; in particular, under conditions on the ground cost, we show that it is non-negative and minimized when the time series are equal. We also propose a new "sharp" variant by further removing entropic bias. We showcase our divergences on time series averaging and demonstrate significant accuracy improvements compared to both DTW and soft-DTW on 84 time series classification datasets.